Explaining Data-Driven Document Classifications

نویسندگان

  • David Martens
  • Foster J. Provost
چکیده

Many document classification applications require human understanding of the reasons for data-driven classification decisions: by managers, client-facing employees, and the technical team. Predictive models treat documents as data to be classified, and document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Unfortunately, due to the high dimensionality, understanding the decisions made by document classifiers is very difficult. This paper begins by extending the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements. The main theoretical contribution of the work is the definition of a new sort of explanation as a minimal set of words (terms, more generally), such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to find such explanations, as well as a framework to assess such an algorithm’s performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing objectionable content, with the goal of allowing advertisers to choose not to have their ads appear there. A second empirical demonstration on news-story topic classification uses the 20 Newsgroups benchmark dataset. The results show the explanations to be concise and document-specific, and to be capable of providing better understanding of the exact reasons for the classification decisions, of the workings of the classification models, and of the business application itself. We also illustrate how explaining documents’ classifications can help to improve data quality and model performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Explaining the Components of Moral Education of Learners and Analyzing its Position in Fundamental Reform Document of Education

The aim of this study was to identify the components of moral education of learners in the higher-tier documents of the country’s educational system and to determine the level of attention paid to these components in the Fundamental Reform Document of Education. In this research, a qualitative-deductive content analysis method was utilized. For this purpose, the unit of analysis was all the sen...

متن کامل

Examination of Vroom’s motivational theory: A new marketing strategy in consumers of online document delivery services: Case study of Shahid Chamran University of Ahvaz

This study aimed to identify and test expectancy motivational model as a theoretical framework to explain the reasons motivating expected information consumer’s behavior for the selection and use of document delivery services of Shahid Chamran University, Ahvaz. In this study, explanatory survey method was used. In order to test the hypotheses and analysis of model’s data, covariance structural...

متن کامل

Explaining the hindering factors for accepting the disease, a barrier for the HIV/AIDS patients to seek treatment: A qualitative study

Background and Purpose: Not accepting the disease by HIV/AIDS patients causes a lot of side effects for the patients and their treatment process. This study aims00 to explain the hindering factors in accepting the disease by HIV-AIDS patients as a barrier for seeking treatment. Methods: This research is a part of a grounded theory study. The data were collected through semi- structured intervie...

متن کامل

Learning for Question Answering and Text Classification: Integrating Knowledge-Based and Statistical Techniques

It is a time consuming and difficult task for an individual, a group, or an organization to classify large collections of documents under a content-driven taxonomy. In this paper, we outline an approach for building a system which makes the classification process the responsibility of the author of the document, thus allowing the author to explain classifications and verify (or correct) automat...

متن کامل

Explaining Heterogeneity in Risk Preferences Using a Finite Mixture Model

This paper studies the effect of the space (distance) between lotteries' outcomes on risk-taking behavior and the shape of estimated utility and probability weighting functions. Previously investigated experimental data shows a significant space effect in the gain domain. As compared to low spaced lotteries, high spaced lotteries are associated with higher risk aversion for high probabilities o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • MIS Quarterly

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2014